Skip to content

Add support for Synthetics Private Locations for DDR in dd-sync-cli#526

Merged
melkouri merged 8 commits into
mainfrom
malak.elkouri/SYNTH-26118/update-PL-creation-for-ddr
Jun 8, 2026
Merged

Add support for Synthetics Private Locations for DDR in dd-sync-cli#526
melkouri merged 8 commits into
mainfrom
malak.elkouri/SYNTH-26118/update-PL-creation-for-ddr

Conversation

@melkouri

@melkouri melkouri commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

  1. Adds DDR (Disaster Recovery) support to the synthetics_private_locations resource, enabling PL replication from a source org (R1) to a destination org (R2).

Related dogweb PRs:

RFC: https://docs.google.com/document/d/1pxwa2vqa5I_NkhWlhXlcuFsDgSbBVQNdeXFEQ1vzQSM/edit?tab=t.0#heading=h.tmwe8zpx1a8q

  1. Migrates synthetics_tests off skip_resource_mapping=True to use the map_existing_resources infrastructure. Adds a cross-run dedup short-circuit so a sync run that finds a destination test already stamped with a known source_public_id adopts it into state instead of POSTing a duplicate. This guards against orphan accumulation when prior sync POSTs succeeded server-side but the client never received the response (504-after-success under destination API load).

Description of the Change

New CLI option: --datadog-host-override

  • Added to constants.py (DD_DATADOG_HOST_OVERRIDE env var), options.py (CLI flag), and configuration.py (threaded to the Configuration dataclass).
  • Optional CNAME override passed to the DDR create endpoint for DNS failover.

Updated synthetics_private_locations.py:

  1. excluded_attributes: added ddr_metadata, pl_id, and public_key_test. ddr_metadata is returned on the destination for DDR PLs and must not be diffed against the source. pl_id and public_key_test are returned by the source-side GET (with include_pl_info=true) and are only used at create time, so they must not show up as diffs.

  2. import_resource(): calls GET /api/v1/synthetics/private-locations/{id}?include_pl_info=true so the source state captures pl_id and public_key_test at import time. No extra fetch at sync time.

  3. create_resource(): when a source PL is being created at the destination:

    • Reads pl_id + public_key_test from the source state (already captured at import).
    • Strips null metadata from the request body (DDR endpoint rejects it).
    • Injects ddr_metadata.disaster_recovery with source_pl_id and source_name. The dogweb schema is strict (additionalProperties: false) and accepts only these two fields.
    • Sets test_encryption_public_key to the JSON-stringified public_key_test object ({pem, fingerprint, id}).
    • Optionally sets datadog_host_override.
    • Parses the DDR response and returns resp["private_location"]. When --datadog-host-override is set, it appears at private_location.config.datadogHostOverride and ships in the regular destination state JSON.

Updated synthetics_tests.py:

  1. resource_config: replaced skip_resource_mapping=True with resource_mapping_key="metadata.disaster_recovery.source_public_id". That stamp is already injected on every source test by pre_resource_action_hook, so it doubles as the dedup key.

  2. map_existing_resources() override: fetches destination tests via GET /api/v1/synthetics/tests?include_metadata=true and indexes them by source_public_id. Custom override avoids the destination-versions side-effect that the existing get_resources() has.

  3. create_resource() dedup short-circuit: before issuing a new POST, checks if a destination test already carries this source_public_id. If yes, adopts it into destination state and calls update_resource() instead of POSTing. Prevents new duplicates when re-syncing after a prior orphan-creating run.

Test fixture update: moved synthetics_tests from OPT_OUT_RESOURCES to MAPPING_RESOURCES in tests/unit/test_map_existing_resources.py.

Operational guidance

When running against staging (app.datad0g.com), throttle concurrency to avoid McNulty pool exhaustion (512 status codes; see Confluence):

--max-workers 20 --http-client-retry-timeout 180

The default --max-workers 100 saturates dogweb's gunicorn pool on staging, producing 512s and 504s. The 504s are particularly bad on POST because dogweb may have written the resource server-side before timing out, and the client's retry then duplicates it. Throttling keeps the pool happy and eliminates this in-POST duplication path. Prod has more headroom and can usually run at the default.

Known limitations

In-POST retry duplication: if a single POST attempt 504s and the client retries, dogweb may have already created the resource. Each retry creates another server-side row. The throttling guidance above is the operational mitigation; a proper fix (skipping 504 retries on POST in custom_client.py) is out of scope for this PR.

Verification Process

Tested end-to-end on staging (app.datad0g.com) between two orgs with --max-workers 20 --http-client-retry-timeout 180:

  1. Created PLs and tests in source org (R1).
  2. Ran import: confirmed source state captured pl_id and public_key_test via include_pl_info=true, and that every R1 test landed in resources/source/synthetics_tests.json.
  3. Ran sync: confirmed PLs created via the DDR endpoint with the right schema, tests created in R2 with metadata.disaster_recovery stamps, and destination state matches R2 counts after throttled runs.
  4. Verified end-to-end DDR replication: manually unpaused one of the replicated tests in R2 and confirmed synthetics_private_locations_check_assignment populated and the test began running on the replicated PL.
  5. Ran a second sync after dropping a destination state entry, and confirmed the dedup short-circuit adopts the existing R2 test instead of creating a duplicate.

Follow-up

A separate PR will revisit the reset command semantics ("delete only sync-cli-managed resources" vs current "delete everything in R2"). That conversation is out of scope here.

Release Notes

  • Added DDR (Disaster Recovery) support for synthetics_private_locations.
  • New --datadog-host-override CLI option for optional CNAME override during PL replication.
  • Migrated synthetics_tests to map_existing_resources with cross-run dedup against metadata.disaster_recovery.source_public_id. Re-running sync after a degraded run no longer accumulates duplicate tests in the destination org.

melkouri and others added 2 commits May 27, 2026 13:25
Resolves conflict in synthetics_private_locations.py: keep DDR
include_pl_info=true query param while adopting main's new
state.set_source() API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-5

This comment has been minimized.

…un dedup

Previously synthetics_tests had skip_resource_mapping=True, so the sync
had no way to detect a destination test that already existed when state
didn't know about it. Under degraded destination APIs (504/5xx storms
during McNulty pool exhaustion on staging), repeated sync runs were
accumulating duplicate tests in R2: a prior sync's POST succeeded
server-side, the client exhausted retries and never persisted the
destination id, and the next sync's create branch POSTed again.

This migration:
- Replaces skip_resource_mapping with
  resource_mapping_key="metadata.disaster_recovery.source_public_id",
  the stamp pre_resource_action_hook already writes on every source
  test before create/update.
- Adds a map_existing_resources() override that LISTs destination tests
  with include_metadata=true and indexes them by source_public_id,
  without the destination-versions side-effect that get_resources has.
- Adds a short-circuit at the top of create_resource: if R2 already has
  a test stamped with this source_public_id, adopt it into state and
  call update_resource instead of POSTing a new copy.

This does not address in-POST retry duplications (those happen mid-POST,
after the existing_resources_map is built and frozen). Operational
mitigation for staging-degraded conditions:
  --max-workers 20 --http-client-retry-timeout 180

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@melkouri melkouri marked this pull request as ready for review June 3, 2026 12:24
@melkouri melkouri requested a review from a team as a code owner June 3, 2026 12:24
@melkouri

melkouri commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

TO DO: Add guideline to how to sync the Synthetics Private Locations for Customers

@melkouri melkouri changed the title Adapt sync-cli to support synthetics PLs replication for DDR Add support for Synthetics Private Locations for DDR in dd-sync-cli Jun 3, 2026
melkouri and others added 2 commits June 4, 2026 10:42
Per reviewer feedback (Michael Richey, Ron Hay): the "delete only sync-cli-
managed resources" change to reset is a broader semantic discussion that
deserves its own PR and possibly its own command name. Moving the behavior
change out of this PR so the DDR PL replication can be reviewed in isolation.

The discussion will continue in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- PL cassettes (test_synthetics_private_locations and test_cli):
  * Append ?include_pl_info=true to source PL GET URLs.
  * Add pl_id + public_key_test to GET responses (consumed by
    create_resource for ddr_metadata + test_encryption_public_key).
  * Update POST bodies to include ddr_metadata.disaster_recovery and
    test_encryption_public_key as JSON-stringified payload.

- synthetics_tests cassettes: pre-pend recorded GET LIST interactions
  against the destination for the new map_existing_resources call,
  with empty {"tests": []} response (no orphans to dedupe in the
  test scenario).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@melkouri melkouri merged commit 2c1e170 into main Jun 8, 2026
11 checks passed
@melkouri melkouri deleted the malak.elkouri/SYNTH-26118/update-PL-creation-for-ddr branch June 8, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants